Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spellchecker api #3

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
Draft

Spellchecker api #3

wants to merge 14 commits into from

Conversation

nthykier
Copy link
Owner

No description provided.

nthykier and others added 14 commits May 25, 2024 07:38
No new code is introduced; only existing code is shuffled around and
the functions moved are unchanged as well.
When the spelling dictionaries are loaded, previously the correction
line was just stored in memory as a simple text. Through out the code,
callers would then have to deal with the `data` attribute, correctly
`split()` + `strip()` it. With this change, the dictionary parsing
code now encapsulates this problem.

The auto-correction works from the assumption that there is only one
candidate. This assumption is invariant and seem to be properly
maintained in the code. Therefore, we can just pick the first
candidate word when doing a correction.

In the code, the following name changes are performed:

 * `Misspelling.data` -> `Misspelling.candidates`
 * `fixword` -> `candidates` when used for multiple candidates
   (`fixword` remains for when it is a correction)

On performance:

Performance-wise, this change moves computation from "checking" time
to "startup" time.  The performance cost does not appear to be
noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on
the ratio of cased vs. non-cased corrections with multiple candidates
in mind.

The all lowercase typo is now slightly more expensive (it was passed
throughout `fix_case` and fed directly into the `print` in the
original code. In the new code, it will always need a `join`).  There
are still an overweight of lower-case only corrections in general, so
the unconditional `.join` alone is not sufficient to affect the
performance noticeably.
This is as close to a 1:1 conversion as possible. It might change
whhen we get to designing the API. The callers have been refactored to
only perform the lookup once. This was mostly to keep the code more
readable.

The performance cost does seem noticable, which is unsurprising. This
method has a higher cost towards non-matches which is the most common
case.  This commit causes the performance to drop roughly 10% on its
and we are now slower than the goal.
The refactor is a stepping stone towards the next commit where the
inner loop is moved to the `Spellchecker`.
With this rewrite, performance improved slightly and is now down to 7%
slower than the baseline (6s vs. 5.6s).

There is deliberate an over-indentation left in this commit, since
that makes this commit easier to review (without ignoring space
changes).
Deliberately in a separate. There are no functional changes, but there
are some reformatting changes (line merges) as a consequence of the
de-indent.
The `Spellchecker` only needs the `group` method from the `re.Match`.
With a bit of generics and typing protocols, we can make the
`Spellchecker` work with any token type that has a `group()` method.

The `codespell` command line tool still assumes `re.Match` but it can
get that via its own line tokenizer, so it all works out for everyone.
The new API has introduced extra overhead per line being spellchecked.
One way of optimizing out this overhead, is to spellcheck fewer lines.
An obvious choice here, is to optimize out empty and whitespace-only
lines, since they will not have any typos at all (on account of not
having any words).

A side-effect of this change is that we now spellcheck lines with
trailing whitespace stripped. Semantically, this gives the same result
(per "whitespace never has typos"). Performance-wise, it is faster in
theory because the strings are now shorter (since we were calling
`.rstrip()` anyway). In pratice, I am not sure we are going to find
any real corpus where the trailing whitespace is noteworthy from a
performance point of view.

On the performance corpus from codespell-project#3491, this takes out ~0.4s of
runtime brining us down to slightly above the 5.6s that made the
baseline.
This makes the API automatically avoid some declared false-positives
that the command line tool would also filter.
The changes to provide a public API had some performance related costs
of about 1% runtime. There is no trivial way to offset this any
further without undermining the API we are building. However, we can
pull performance-related shenanigans to compenstate for the cost
introduced.

The codespell codebase unsurprisingly spends a vast majority of its
runtime in various regex related code such as `search` and `finditer`.

The best way to optimize runtime spend in regexes is to not do a regex
in the first place, since the regex engine has a rather steep overhead
over regular string primitives (that is the cost of flexibility). If
the regex rarely matches and there is a very easy static substring
that can be used to rule out the match, then you can speed up the code
by using `substring in string` as a conditional to skip the
regex. This is assuming the regex is used enough for the performance
to matter.

An obvious choice here falls on the `codespell:ignore` regex, because
it has a very distinctive substring in the form of `codespell:ignore`,
which will rule out almost all lines that will not match.

With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus
mentioned in codespell-project#3419.
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.4.4 → v0.4.5](astral-sh/ruff-pre-commit@v0.4.4...v0.4.5)
- [github.com/codespell-project/codespell: v2.2.6 → v2.3.0](codespell-project/codespell@v2.2.6...v2.3.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant